Skip to content

BUG: Require sample weights to sum to less than 1 when replace = True #61582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

microslaw
Copy link

@microslaw microslaw commented Jun 6, 2025

@microslaw microslaw force-pushed the add-df-sample-weight-constraints branch 2 times, most recently from 1281002 to 1d1f742 Compare June 6, 2025 11:15
@mroeschke mroeschke requested a review from rhshadrach June 6, 2025 16:38
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, but the condition needs adjusted. Will also need a note in the whatsnew for 3.0 under the Other section.

Comment on lines 153 to 159
if weight_sum > 1 and replace:
raise ValueError(
"Invalid weights: If `replace`=True weights must sum to less than 1"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct condition. See the formula in the OP of the linked issue.

Comment on lines 5818 to 5819
When replace = True will not allow weights that add up to less
than 1, to avoid biased results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When replace = True will not allow weights that add up to less
than 1, to avoid biased results.
When replace = False will not allow ``(n * max(weights) / sum(weights)) > 1``,
to avoid biased results.

@microslaw microslaw force-pushed the add-df-sample-weight-constraints branch 3 times, most recently from 2bc3618 to e5f9a07 Compare June 6, 2025 21:53
@microslaw microslaw marked this pull request as draft June 6, 2025 21:58
@microslaw microslaw force-pushed the add-df-sample-weight-constraints branch 3 times, most recently from 688cabe to 2aacee7 Compare June 11, 2025 23:08
@microslaw microslaw force-pushed the add-df-sample-weight-constraints branch from 4dbeaf3 to 1dfc85d Compare June 12, 2025 17:37
@microslaw microslaw force-pushed the add-df-sample-weight-constraints branch from 9f62a65 to 11df613 Compare June 12, 2025 22:32
@microslaw microslaw marked this pull request as ready for review June 13, 2025 21:07
@microslaw
Copy link
Author

  1. I've added this as an exception, but I've also considered doing this as a warning, since the issue is

    • easy to encounter - one case occured in user_guide
    • hard to explain within a short error message - requires specific equation and some level of understanding of probability
    • has minimal effect on majority of use cases - as long as the variance isn't huge, the bias this error introduces is minimal, and for very specific cases e.g. the village sampling from the op warning should suffice
  2. I also deviated a bit from original issue in that i'm allowing max(w) * n / sum (w) == 1. As far as I understand, it doesn't introduce any bias, and it allows some specific uses, e.g. df.sample(n=1,weights=[1,0])

  3. If this is implemented as an exception, I believe that there is no way for the numpy exception "Fewer non-zero entries in p than size" to occur, as:

    • w - list of weights
    • n - number of non-zero weights in w
    • size - size of the sample to be drawn
    • len - length of w

    I'm assuming that

    • the weights are normalized, so sum(w) = 1, as it doesn't change anything and makes the math easier
    • the weights are equal, since if any of them is larger than the others, max(w) and max(w) * size will increase
    • therefore w = [1/n, 1/n, ..., 1/n] or any combination like w = [1/n, 0, 1/n, 0, ..., 1/n] where n is the number of non-zero weights
    • for the "Fewer non-zero entries in p than size" to occur, size > n
    • for the "Invalid weights: If replace=False, " to occur, size * max(w) / sum(w) >= 1
    • size * max(w) / sum(w) >= 1
    • size * max(w) >= 1, as sum(w) = 1
    • size * 1/n >= 1, as max(w) = 1/n
    • size >= n

@microslaw microslaw requested a review from rhshadrach June 13, 2025 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.sample weights not required to sum to less than 1
3 participants